Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[r] Add enumeration support #1559

Merged
merged 4 commits into from
Sep 15, 2023
Merged

[r] Add enumeration support #1559

merged 4 commits into from
Sep 15, 2023

Conversation

eddelbuettel
Copy link
Contributor

@eddelbuettel eddelbuettel commented Jul 28, 2023

Issue and/or context:

As described in #1558 and #866, adding enumeration support is desirable once we have TileDB Embedded 2.17 available

Changes:

This PR supports reading of columns with enumerations (aka dictionaries aka factor variable) directly via Arrow. Preliminary write support is also available (but still goes through the tiledb R package for writes).

Notes for Reviewer:

This PR is now work-in-progress and not ready for a merge while we await TileDB 2.17. The branch and PR are ready but should only be merged once prequisites are been merged. It likely needs #1519 (C++ side) and #1663 (CI support).

CI is turned off as the TileDB default build is still without support for enumerations.

@shortcut-integration
Copy link

This pull request has been linked to Shortcut Story #32235: [r] Support enumerations in SOMA.

@eddelbuettel
Copy link
Contributor Author

A quick snapshot of what works as of today / this branch when 2.17 is available.

Code

library(tiledbsoma)
library(tiledb)

pp <- palmerpenguins::penguins
uri <- tempfile(pattern="tiledb")

fromDataFrame(pp, uri)

## read one: soma_array_reader
p <- soma_array_reader(uri)
print(tibble::as_tibble(arrow::RecordBatch$import_from_c(p[[1]], p[[2]])))

## read twp: iterated reader for sdf
p <- SOMADataFrameOpen(uri)$read()$concat()
print(tibble::as_tibble(p))

Output

# A tibble: 344 × 9
   `__tiledb_rows` species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex     year
             <int> <fct>   <fct>              <dbl>         <dbl>             <int>       <int> <fct>  <int>
 1               1 Adelie  Torgersen           39.1          18.7               181        3750 male    2007
 2               2 Adelie  Torgersen           39.5          17.4               186        3800 female  2007
 3               3 Adelie  Torgersen           40.3          18                 195        3250 female  2007
 4               4 Adelie  Torgersen           NA            NA                  NA          NA NA      2007
 5               5 Adelie  Torgersen           36.7          19.3               193        3450 female  2007
 6               6 Adelie  Torgersen           39.3          20.6               190        3650 male    2007
 7               7 Adelie  Torgersen           38.9          17.8               181        3625 female  2007
 8               8 Adelie  Torgersen           39.2          19.6               195        4675 male    2007
 9               9 Adelie  Torgersen           34.1          18.1               193        3475 NA      2007
10              10 Adelie  Torgersen           42            20.2               190        4250 NA      2007
# ℹ 334 more rows
# A tibble: 344 × 9
   `__tiledb_rows` species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex     year
             <int> <fct>   <fct>              <dbl>         <dbl>             <int>       <int> <fct>  <int>
 1               1 Adelie  Torgersen           39.1          18.7               181        3750 male    2007
 2               2 Adelie  Torgersen           39.5          17.4               186        3800 female  2007
 3               3 Adelie  Torgersen           40.3          18                 195        3250 female  2007
 4               4 Adelie  Torgersen           NA            NA                  NA          NA NA      2007
 5               5 Adelie  Torgersen           36.7          19.3               193        3450 female  2007
 6               6 Adelie  Torgersen           39.3          20.6               190        3650 male    2007
 7               7 Adelie  Torgersen           38.9          17.8               181        3625 female  2007
 8               8 Adelie  Torgersen           39.2          19.6               195        4675 male    2007
 9               9 Adelie  Torgersen           34.1          18.1               193        3475 NA      2007
10              10 Adelie  Torgersen           42            20.2               190        4250 NA      2007
# ℹ 334 more rows

@eddelbuettel eddelbuettel force-pushed the de/sc-32235/enumeration_support branch 2 times, most recently from c45ee26 to 4eb3f90 Compare July 31, 2023 20:28
@eddelbuettel eddelbuettel force-pushed the de/sc-32235/enumeration_support branch 3 times, most recently from 9a4c341 to 407f83e Compare August 18, 2023 20:04
@eddelbuettel eddelbuettel force-pushed the de/sc-32235/enumeration_support branch 3 times, most recently from 8013da5 to 4668e2e Compare August 31, 2023 00:58
@johnkerl
Copy link
Member

johnkerl commented Sep 6, 2023

@eddelbuettel can you please resolve conflicts?

@eddelbuettel eddelbuettel force-pushed the de/sc-32235/enumeration_support branch from 4668e2e to 455182e Compare September 6, 2023 15:56
@eddelbuettel
Copy link
Contributor Author

eddelbuettel commented Sep 6, 2023

Rebased again (which was a little more involved than usual), passes 1930 tests here.

(I used to have all CI turned off waiting for the merge of TileDB Core, that regressed so some read X now.)

@eddelbuettel eddelbuettel force-pushed the de/sc-32235/enumeration_support branch from fd647bf to ed40f3b Compare September 6, 2023 18:53
@eddelbuettel eddelbuettel marked this pull request as ready for review September 8, 2023 13:56
@eddelbuettel eddelbuettel changed the title [r] [WIP] Add enumeration support [r] Add enumeration support Sep 8, 2023
@eddelbuettel eddelbuettel force-pushed the de/sc-32235/enumeration_support branch from ed40f3b to 7f9ee36 Compare September 8, 2023 20:52
@codecov-commenter
Copy link

codecov-commenter commented Sep 8, 2023

Codecov Report

Patch has no changes to coverable lines.

❗ Current head 513b478 differs from pull request most recent head 10a94cd. Consider uploading reports for the commit 10a94cd to get more accurate results

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

📢 Thoughts on this report? Let us know!.

Copy link
Member

@johnkerl johnkerl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢

Pending CI of course! 😎

@eddelbuettel
Copy link
Contributor Author

eddelbuettel commented Sep 14, 2023

Pending CI of course! 😎

I should probaby update the libtiledbsoma side from the rc to the release too. Or will you take of that 'soon' (pending the unit test update) ?

(And I do expect this to go ❌ on python interop as the last ones did....) And so it happens. So I thing this needs some work somewhere else first, then a rebase and then we can talk :shipit:

@johnkerl
Copy link
Member

@eddelbuettel I tried #1677 as a candidate to go underneath #1519 #1511 #1559. There's a CI issue I'm puzzling over. Let's chat in Slack.

@eddelbuettel
Copy link
Contributor Author

@johnkerl I am lost. Might be a candidate for outreach to the author of these contributed tests too.

@johnkerl
Copy link
Member

johnkerl commented Sep 14, 2023

The failing CI step is R/Python interop which is here
https://github.com/single-cell-data/TileDB-SOMA/blob/main/.github/workflows/r-python-interop-testing.yml
and it does quite simply python -m pip -v install -e apis/python, and apis/python/setup.py has tiledb (tiledb-py) dependency still 0.22.2 (you can see core 2.16.3 in the show_package_versions() output in the failed CI job). This is where we are at until TileDB-Py 0.23.0 is released. So, let's wait for that.

johnkerl added a commit that referenced this pull request Sep 14, 2023
* [c++] Support `Enumeration` in C++ Codebase

* Addition of `SOMAArray::get_enum` and `SOMAArray::get_enum_label_on_attr`
* Attach an enumeration/dictionary to the `ColumnBuffer` if applicable;
  this is used when converting from `ArrayBuffers` to Arrow Tables in
  the Python and R APIs

* Add `get_attr_to_enum_mapping` Function

* WIP fix bug where attr name was passed instead of enum name

* Add Unit Tests for Enumeration in C++

* `to_varlen_buffers` Returns `std::string`

* Prior to TileDB-Inc/TileDB#4272, the SOMA unit
tests were erroneously writing a byte vector for string dimensions which
maps to `TILEDB_BLOB` rather than `TILEDB_STRING_ASCII`

* Update SOMA Array get_metadata Signatures

* Depend on 2.17.0-rc0

* resolve an incomplete merge with #1559

* [r] Update tiledb-r to RC (borrowed from #1663, #1665)

* [r] Undo brown-bag typo in helper script

* 2.17.0

---------

Co-authored-by: John Kerl <[email protected]>
Co-authored-by: Dirk Eddelbuettel <[email protected]>
@eddelbuettel eddelbuettel force-pushed the de/sc-32235/enumeration_support branch from 6ca5418 to de82254 Compare September 14, 2023 21:33
@eddelbuettel
Copy link
Contributor Author

Ok, did a first rebase (which was much smother than a local one I had recently) so fingers crossed!

As described in #1558 and #866, adding enumeration support is desirable once we have TileDB Embedded 2.17 available

**Changes:**

This PR supports reading of columns with enumerations (aka dictionaries aka factor variable) directly via Arrow. Preliminary write support is also available (but still goes through the `tiledb` R package for writes).

**Notes for Reviewer:**

~This PR is now work-in-progress and not ready for a merge while we await TileDB 2.17.~  The branch and PR are ready but should only be merged once prequisites are been merged.  It likely needs #1519 (C++ side) and #1663 (CI support).

CI is turned off as the TileDB default build is still without support for enumerations.
This PR adds support for return Arrow tables with dictionaries that can include `ordered` enumerations.

**Changes:**

Given #1559 which it depends upon, a very small change to just three files in `libtiledbsoma`.

This should become clearer once the dependent PR is merged and can be rebased.

**Notes for Reviewer:**

[SC 34073](https://app.shortcut.com/tiledb-inc/story/34073/c-add-ordered-support-to-arrow-export)
This PR extends the `schema()` function to return an Arrow schema with enumerations including `ordered`.

**Changes:**

Given #1559 which it depends upon, a very small change to just one file.

This should become clearer once the dependent PR is merged and can be rebased.

**Notes for Reviewer:**

[SC 34074](https://app.shortcut.com/tiledb-inc/story/34074/c-add-ordered-support-to-arrow-export)
@johnkerl johnkerl mentioned this pull request Sep 15, 2023
johnkerl added a commit that referenced this pull request Sep 15, 2023
* ihn/bugfix

* unit-test update

* lint
@eddelbuettel
Copy link
Contributor Author

Ok if we all hold our breath for 20 minutes I am sure this run will turn all green....

@johnkerl
Copy link
Member

Ok if we all hold our breath for 20 minutes I am sure this run will turn all green....

@eddelbuettel That's the plan! :)

@eddelbuettel
Copy link
Contributor Author

eddelbuettel commented Sep 15, 2023

Cnttlkrtksmhldngmbrthshrd

@eddelbuettel
Copy link
Contributor Author

(That was "Can't type or talk as I am holding my breath" approximately free of vowels ... 😉 )

* ihn/bugfix

* unit-test update

* lint
@ihnorton ihnorton force-pushed the de/sc-32235/enumeration_support branch from 0d7cb02 to 10a94cd Compare September 15, 2023 21:45
@eddelbuettel
Copy link
Contributor Author

eddelbuettel commented Sep 15, 2023

Yay yay yay

Not so fast. One more.

@johnkerl johnkerl merged commit 9b8fa52 into main Sep 15, 2023
@johnkerl johnkerl deleted the de/sc-32235/enumeration_support branch September 15, 2023 22:07
mojaveazure added a commit that referenced this pull request Sep 16, 2023
Allow `write_soma.data.frame()` to write factors as categoricals. `write_soma()` will no longer cast factors to characters prior to writing
Redo of #1683 since it got removed from #1559
johnkerl pushed a commit that referenced this pull request Sep 17, 2023
* [r] Update `write_soma.data.frame()` to support categoricals
Allow `write_soma.data.frame()` to write factors as categoricals. `write_soma()` will no longer cast factors to characters prior to writing
Redo of #1683 since it got removed from #1559

* Handle enumerations in `write_soma.data.frame()`

* Add tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants